DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs

107

BN layer, non-linear activation layer, and max-pooling layer. We omit these for the sake

of simplification. Then, the output ˆaout is binarized to bˆaout by the sign function. The

fundamental objective of BNNs is to calculate ˆw. We want it to be as close as possible before

and after binarization to minimize the binarization effect. Then, we define the reconstruction

error following [77] as

LR( ˆw, β) =ˆwβb ˆw2

2.

(4.23)

Based on the above derivation, the vanilla direct BNAS [36, 114] can be defined as

max

ˆw∈W,ˆα∈AR+ fb( ˆw, ˆα, β),

(4.24)

where b ˆw = sign( ˆw) is used for inference and ˆα is a neural architecture with binary weights.

Prior to direct BNAS [36] learning the BNAS from such an objective as

max

ˆw∈W,ˆα∈AR+ ˜fb( ˆw, ˆα, β) =

N



n=1

ˆpn( ˆw, ˆα, β) log(ˆpn(X)),

(4.25)

where we use notations similar to those of Eq. 4.21. Equation 4.25 means that the vanilla

direct BNAS only focuses on the binary search space under the supervision of cross-entropy

loss, which is less effective due to the search process being not exhaustive [24].

4.4.2

Redefine Child-Parent Framework for Network Binarization

Network binarization calculates neural networks with 1-bit weights and activations to fit the

full-precision network, which can significantly compress the CNNs. Prior work [287] usually

investigates the binarization problem by exploring the full-precision model to guide the

optimization of binarized models. Based on the investigation, we reformulate NAS-based

network binarization as a Child-Parent model as shown in Fig. 4.12. The Child and Parent

models are the binarized and full-precision counterparts, respectively.

Conventional NAS is inefficient due to the complicated reward computation in network

training, where the evaluation of a structure is usually done after the network training

converges. There are also some methods to evaluate a cell during the training of the network.

[292] points out that the best choice in the early stages is not necessarily the final optimal

one. However, the worst operation in the early stages usually has a bad performance. This

phenomenon will become more and more significant as the training goes on. Based on this

observation, we propose a simple yet effective operation-removing process, which is the

crucial task of the proposed CP model.

Intuitively, the representation difference between the Children and Parents, and how

many Children can independently handle their problems are two main aspects that should

be considered to define a reasonable performance evaluation measure. Based on this analysis,

we introduce the Child-Parent framework for binary NAS, which defines the objective as

ˆw, ˆα, β=

argmin

ˆwˆ

W∈AR+

LCP-NAS( ˜f P (w, α),

˜f C

b ( ˆw, ˆα, β))

=

argmin

ˆwˆ

W∈AR+

˜f P (w, α)˜f C

b ( ˆw, ˆα, β),

(4.26)

where ˜f P (w, α) denotes the performance of the real-valued parent model as predefined in

Eq. 4.21. ˜f C

b is further defined as ˜f C

b ( ˆw, α, β) = N

n=1 ˆpn( ˆw, α, β) log (ˆpn(X)) following

Eq. 4.25. As shown in Eq. 4.26, we propose L to estimate the performance of candidate